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(54) Method for processing audio-signals 

(57) The invention regards a method for processing 
audio-signals whereby audio signals are captured at two 
spaced apart locations and subject to a transformation 
in the perceptual domain (Bar or Mel), whereupon: 

a. a (blind or supervised) source separation process 
is performed to give a first estimate of the wanted 
signal parts and the noise parts of the microphone 
signals and 

b. a coherence based separation process is per- 
formed to give a second estimate of the wanted sig- 



nal parts and the noise parts of the microphone sig- 
nals, and where further a sound field diffuseness 
detection is performed on the at least two signals, 

whereby further the sound field diffuseness detections 
is used to mix the output from the blind source separa- 
tion and the coherence based separation process in or- 
der to achieve the best possible signal. The transfer 
functions calculated from the source separation are 
used to reconstruct a virtual stereophonic sound field in 
restore the spatial information about the source position 
in the enhanced signals. 
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Description 

AREA OF THE INVENTION 

[0001 ] The invention is related to the area of speech enhancement of audio signals , and more specifically to a method 
for processing audio signal in order to enhance speech components of the signal whenever they are present. Such 
methods are particularly applicable to hearing aids, where they allow the hearing impaired person to better communi- 
cate with other people. 

BACKGROUND OF THE INVENTION 

[0002] The problem of extracting a signal of interest from noisy observations is well known by acoustics engineers. 
Especially, users of portable speech processing systems often encounter the problem of interfering noise reducing the 
quality and intelligibility of speech. To reduce these harmful noise contributions, several single channel speech en- 
hancement algorithms have been developed [1-4]. Nonetheless, even though single-channel algorithms are able to 
improve signal quality, recent studies have reported that they are still unable to improve speech intelligibility [5]. In 
contrast, multiple-microphone noise reduction schemes have been shown repeatedly to increase speech intelligibility 
and quality [6,7], 

[0003] Multiple microphone speech enhancement algorithms can be roughly classified into quasi-stationary spatial 
filtering and time-variant envelope filtering [8], Quasi-stationary spatial filtering exploits the spatial configuration of the 
sound sources to reduce noise by spatial filter. The filter characteristics do not change with the dynamics of speech 
but with the slower changes in the spatial configuration of the sound sources. They achieve almost artefact-free speech 
enhancement in simple, low reverberating environments and computer simulations. Typical examples are adaptive 
noise cancelling, positive and differential beam-forming [30] and blind source separation [28,29]. The most promising 
algorithms of this class proposed hitherto are based on blind source separation (BSS). BSS is the sole technique, 
which aims to estimate an exact model of the acoustic environment and to possibly invert it. It includes the model for 
de-mixing of a number of acoustic sources from an equal number of spatially diverse recordings. Additionally, mutti- 
path propagation, though reverberation is also included in BSS models. The basic problem of BSS consists in recovering 
hidden source signals using only its linear mixtures and nothing else. Assume d s statistically independent sources $ 
(t) » [s^O,..., (Q] r . These sources are convolved and mixed in a linear medium leading to d x sensor signals x(f) « 
[x 1 (f),...,x dx (/)] r that may include additional noise: 

x(0-2G(r)s(/-r) + n(r). 

The aim of source separation is to identify the multiple channel transfer characteristics G(t), to possibly invert it and 
to obtain estimates of the hidden sources given by: 



u(0 = £w(r)x(r~r) 

ttt0 (2) 

where W(x) is the estimated inverse multiple channel transfer characteristics of G(x). Numerous algorithms have been 
proposed for the estimation of the inverse model W(i). They are mainly based on the exploitation of the assumption 
on the statistical independence of the hidden source signal. The statistical independence can be exploited in different 
ways and additional constraints can be introduced, such as for example intrinsic correlations or non-stationnarity of 
source signals and/or noise. As a result a large number of BSS algorithms under various implementation forms (e.g. 
time domain, frequency domain and time-frequency domain) have been proposed recently for multiple-channel speech 
enhancement (see for example [28,29]). 

[0004] Dogan and Stems [9] use cumulant based source separation to enhance the signal of interest in binaural 
hearing aids. Rosea et al. [10] apply blind source separation for demixing delayed and convoluted sources from the 
signals of a microphone array. A post-processing is proposed to improve the enhancement. Jourjine et al. [11] use the 
statistical distribution of the signals (estimated using histograms) to separate speech and noise. Balan et al. [2] propose 
an autoregressive ( AR) modelling to separate sources from a degenerated mixture. Several approaches use the spatial 
information given by a plurality of microphone using beamformers. Koroljow and Gibian [12] use first and second order 
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beamformer to adapt the directivity of the hearing aids to the noise conditions. 

[0005] Bhadkamkar and Ngo [3] combine a negative beamformer to extract the speech source and a post-processing 
to remove the reverberation and echoes. Lindemann [13] uses a beamformer to extract the energy from the speech 
source and an omni-directiona) microphone to obtain the whote energy from the speech and noise sources. The ratio 
between these two energies allows to enhance the speech signal by a spectral weighting. Feng et al. [1 4] reconstructs 
the enhanced signal using delayed versions of the signals of a binaural hearing aid system. 
[0006] BSS techniques have been shown to achieve almost artefact-free speech enhancement in simple, low rever- 
berating environments, laboratory studies and computer simulations but perform poorly for recordings in reverberant 
environment or/and with diffuse noise. One could speculate that in reverberant environments the number of model 
parameters becomes too large to be identified accurately in noisy, non-stationary conditions. 

[0007] In contrast, envelope filtering (e.g. Wiener, DCT-Bark, coherence and directional filtering) do not yield such 
failures since they use a simple statistical description of the acoustical environment or the binaural interaction in the 
human auditory system [8]. Such algorithms process the signal in an appropriate dual domain. The envelope of the 
target signal or equivalently a short time weighting index (short-time signal-to-noise ratio (SNR), coherence) is esti- 
mated in several frequency bands. The target is assumed to be of frontal incidence and the enhanced signal is obtained 
by modulating the spectral envelope of the noisy signal by the estimated short time weighting index. The adaptation 
of the weighting index has a temporal resolution of about the syllable rate. Dual channel approaches based on the 
statistical description of the sources using the coherence function have been presented [1 ,15-17]. Further improve- 
ments have been obtained by merging spatial coherence of noisy sound fields, masking properties of the human au- 
ditory system and subspace approaches [19]. 

[0008] Multi-channel speech enhancement algorithms based on envelope filtering are particularly appropriate for 
complex acoustic environments, namely diffuse noise and highly reverberating. Nevertheless, they are unable to pro- 
vide toss-less or artefact-free enhancement. Globally, they reduce noise contributions in the time-frequency domains 
without any speech contributions. In contrast, in time-frequency domains with speech contributions, the noise cannot 
be reduced and distortions can be introduced. This is mainly the reason why envelope filtering might help reducing the 
listening effort in noisy environments but intelligibility improvement is generally leaking [20]. 

[0009] The above considerations point out that performance of multiple channel speech enhancement algorithms 
depend essentially on the complexity of the acoustical context. A given algorithm is appropriated for a specific acoustic 
environment and in order to cope with changing properties of the acoustic environment composite algorithms have 
been proposed more recently. 

[0010] The approach proposed by Melanson and Lindemann in [21 ] consists in a manual switching between different 
algorithms to enhance speech under various conditions. A manual switching between several combinations of filtering 
and dynamic compression has also been proposed by Lindemann et ai. [22]. 

[0011] More advanced techniques using an automatic switching according to different noise conditions have been 
proposed by Killion et al. in [23]. The input of the hearing aid is switched automatically between omnidirectional and 
directional microphone. 

[0012] A strategy selective algorithm has been described by Wittkop [24]. This algorithm uses an envelope filtering 
based on a generalized Wiener approach and an envelope filtering invoking directional inter-aural level and phase 
differences. A coherence measure is used to identify the acoustical situations and gradually switch off the directional 
filtering with increasing complexity. It is pointed out that this algorithm helps reducing the listening effort in noisy envi- 
ronments but that intelligibility improvement is still lacking. 

[0013] Therefore, it is the aim of the present invention to provide a composite method including source separation 
and coherence based envelope filtering. Source separation and coherence based envelope filtering are achieved in 
the time Bark domain, i.e. in specific frequency bands. Source separation is performed in bands where coherent sound 
fields of the signal of interest or of a predominant noise source are detected. Coherence based envelope filtering acts 
in bands where the sound fields are diffuse and /or where the complexity of the acoustic environment is too large. 
Source separation and coherence based envelope filtering may act in parallel and are activated in a smooth way 
through a coherence measure in the Bark bands. 

[0014] It is further an issue of the present invention to provide a real binaural enhancement of the observed sound 
field by using the multiple channel transfer characteristics identified by source separation. Indeed, commonly speech 
enhancement algorithms achieve mainly a monaural speech enhancement, which implies that users of such devices 
loose the ability to localize sources. A promising solution, which could achieve real binaural speech enhancement, 
consists of a device with one or two microphones in each ear and an RF-IInk in-between. The benefit for the user would 
be enormous. Notably it has been reported that binaural hearing increases the loudness and signal-to-nolse ratio of 
the perceived sound, it Improves intelligibility and quality of speech and allows the localization of sources, which is of 
prime importance in situations of danger. Lindemann and Melanson [25] propose a system with wireless transmission 
between the hearing aids and a processing unit wearied at the belt of the user. Brander [7] similarly proposes a direct 
communication between the two ear devices. Goldberg et al. [26] combine the transmission and the enhancement. 
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Finally optical transmission via glasses has been proposed by Martin [27]. Nevertheless in none of these approaches 
a virtual reconstruction of the binaural sound filed has been proposed. The approach proposed herein, namely exploi- 
tation of the multiple channel transfer characteristics identified by source separation to reconstruct the real sound field 
and attenuat noise contribution considerably improve the security and the comfort of the listener 
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SUMMARY OF THE INVENTION 

5 

[0015] The invention comprises a method for processing audio-signals whereby audio signals are captured at two 
spaced apart locations and subject to a transformation in the perceptual domain (Bark or Mel decomposition), where- 
upon the enhancement of the speech signal is based on the combination of parametric (model based) and non-para- 
metric (statistical) speech enhancement approaches: 

10 

a. a source separation process is performed to give a first estimate of the wanted signal parts and the noise parts 
of the microphone signals and 

b. a coherence based envelope filtering is performed to give a second estimate of the wanted signal parts of the 
microphone signals, 

15 

and where further a sound field diffuseness detection is performed on the at least two signals, whereby further the 
sound field diffuseness detections is used to mix the output from the first and the second source separation process 
in order to achieve the best possible signal. The transfer functions estimated by the source separation algorithms are 
used to reconstruct a virtual stereophonic sound field (spatial localisation of the different sound sources). 

20 [0016] When the speech and noise sources are in the direct sound field (direct path between sound sources and 
microphones is dominant, reverberation is low), the transmission transfer function from each source in each source 
ear system can be estimated and used to separate speech and noise signals by the use of source separation. These 
transfer functions are estimated using source separation algorithms. The learning of the coefficients of the transfer 
functions can be either supervised (when only the noise source is active) or blind (when speech and noise sources 

25 are active simultaneously). The learning rate in each frequency band can be dependant on the signals characteristics. 
The signal obtained with this approach is the first estimated of the clean speech signal. 

[0017] When the noise signal is in the reverberant sound field (contributions from reverberations is comparable to 
those of the direct path), source separation approaches fails due to the complexity of the transfer functions to be 
evaluated. A statistical based envelope filtering can be used to extract speech from noise. The short-time coherence 
30 function calculated in the transform domain (Bark or Mel) allows estimating a probability of presence of speech in each 
Barfc or Mei frequency band. Applying it to the noisy speech signal allows to extract the bands where speech is dominant 
and attenuate those where noise is dominant. The signal obtained with this approach is the second estimate of the 
clean speech signal. 

[0018] These two estimates of the clean speech signal are then mixed to optimise the performance of the enhance- 
35 ment. The mixing is performed independently in each frequency band, depending on the sound field characteristic of 
each frequency band. The respective weight for each approach and for each frequency band is calculated from the 
coherence function. 

[0019] During the combination of the signals calculated from the two approaches, the transfer functions estimated 
by source separation are used to reconstruct a virtual stereophonic sound field and to recover the spatial Information 
40 from the different sources. 

[0020] In a further embodiment of the invention the sound field diffuseness detection is based on the value of a short- 
time coherence function where the coherence function is expressed as: 



iw-)- . ♦sag 

[0021] This function varies between zero and one, according to the amount of "coherent" signal. When the speech 
signal dominates the frequency band, the coherence is close to one and when there is no speech in the frequency 
so band, the coherence is close to zero. Once the diffuseness of the sound field is known, the results of the source 
separation and of the coherence based approach can be combined optimally to enhance the speech signals. The 
combination can be the use of one of the approach when the noise source is totally in the direct sound field or totally 
in the diffuse sound field, or a combination of the results when some of the frequency bands are in the direct sound 
field and other are in the diffuse sound field. 

55 
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